Each of you has a study system your work in and a question of interest. Give an example of one variable that you would sample in order to get a sense of its variation in nature. Describe, in detail, how you would sample for the population of that variable in order to understand its distribution. Questions to consider include, but are not limited to: Just what is your sample versus your population? What would your sampling design be? Why would you design it that particular way? What are potential confounding influences of both sampling technique and sample design that you need to be careful to avoid? What statistical distribution might the variable take, and why?
One variable I would sample is Batrachochytrium dendrobatidis (Bd) fungal infection intensity of northern leopard frogs. In this case, my sample would be the swabs collected from individual amphibians at the study site, and the population would be all of the individuals at that site and their respective pathogen loads. I would sample this population by catching and swabbing frogs from stratified pond sections over multiple seasons, and then using qPCR to quantify fungal zoospore equivalents and calculate pathogen load, in number of zoospores. This approach would have validity because the Bd reads from qPCR reflect the number of zoospores– the infectious agent of interest. It is reliable because all amphibians would be swabbed the exact same way, to ensure that Bd load detected from different swabs is comparable and there is an equal chance of detection from all animals. I would try to make the sample as representative as possible by picking up variation that exists between individuals, habitat patches, and seasons. To do this, I would spend equal time at each stratified section of the pond, try to catch as many individuals as possible (ideal sample size is rarely reached because leopard frogs are challenging to catch), and visit over multiple seasons and years.
Despite my best efforts to sample thoroughly, there are still some confounding influences of this sample design and technique. First off, results may be biased by my ability to detect and catch individuals with a certain disease status. Heavily infected frogs might spend more time in the open and move more slowly than healthy animals – making them easier to catch. There can also be higher variability in pathogen load between years than between months or seasons, which means that (depending on the scale we want for our inference) sufficient long-term sampling may not be possible. Both of these issues would pose challenges to the collection of a representative sample. One other potential challenge is to sample validity: fewer zoospores are picked up from smaller animals but can represent equal disease load. It will be important to correct this for body mass to avoid misleading measurements of disease burden. The resulting data will likely have a bimodal distribution, with a spike around 0 for the uninfected individuals, and a bell curve centered around the mean number of zoospores from infected individuals.
Johns Hopkins has been maintaining one of the best Covid-19 timseries data sets out there. The data on the US can be found here with information about what is in the data at https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data
Download and read in the data. Can you do this without downloading, but read directly from the archive (+1).
#read in data
covid_data <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv") %>%
dplyr::select(-Combined_Key,-Long_,-Lat,-Country_Region,-Admin2, -FIPS, -code3, -iso3, -iso2, -UID)
state_pop_data <- read_tsv("https://raw.githubusercontent.com/thedatachest/us_population/master/statepop.tsv") %>%
filter(Year== max(Year)) #get data from most recent year
#what's here
str(covid_data, list.len = 5)
## tibble [3,340 × 297] (S3: tbl_df/tbl/data.frame)
## $ Province_State: chr [1:3340] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ 1/22/20 : num [1:3340] 0 0 0 0 0 0 0 0 0 0 ...
## $ 1/23/20 : num [1:3340] 0 0 0 0 0 0 0 0 0 0 ...
## $ 1/24/20 : num [1:3340] 0 0 0 0 0 0 0 0 0 0 ...
## $ 1/25/20 : num [1:3340] 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
str(state_pop_data)
## tibble [51 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ State : chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Year : num [1:51] 2015 2015 2015 2015 2015 ...
## $ Population: num [1:51] 4858979 738432 6828065 2978204 39144818 ...
## - attr(*, "spec")=
## .. cols(
## .. State = col_character(),
## .. Year = col_double(),
## .. Population = col_double()
## .. )
The data is, well, huge. It’s also wide, with dates as columns. Write a function that, given a state, will output a time series (long data where every row is a day) of cummulative cases in that state as well as new daily cases.
Note, let’s make the date column that emerges a true date object. Let’s say you’ve called it date_col. If you mutate it, mutate(date_col = lubridate::mdy(date_col)), it will be turned into a date object that will have a recognized order. {lubridate} is da bomb, and I’m hoping we have some time to cover it in the future.
+5 extra credit for merging it with some other data source to also return cases per 100,000 people.
#data function
state_data <- function(state){
state_pop <- state_pop_data %>%
filter(State== state) %>% #filter to given state
summarize(Population) #get population size
covid_long <- covid_data %>%
filter(Province_State== state) %>% #filter to given state
pivot_longer(cols= -Province_State, #pivot so that dates are all in 1 column
names_to= "date",
values_to = "cumulative_cases") %>%
mutate(date=mdy(date)) %>% #reformat date with lubridate
group_by(date) %>% #for each date...
summarize(cumulative_cases=sum(cumulative_cases)) %>% #add cumulative cases to get total cumulative cases for the given state on each day
mutate(daily_cases= as.numeric(cumulative_cases - lag(cumulative_cases, 1))) %>% #calculate daily cases ((cumulative cases for day x) - (cumulative cases for day x-1))
mutate(daily_cases_per_10_mil= (daily_cases*10000000)/state_pop$Population) %>% #add column for daily cases per million people
filter(daily_cases >=0) %>% #remove errors from daily cases reported below 0 (a couple of dates in MA and possibly in other state reports as well)
mutate(cum_cases_per_100000= (cumulative_cases*100000)/state_pop$Population)} #add column for cumulative cases per million people
Mass_data <- state_data("Massachusetts") #test function
str(Mass_data, list.len = 5)
## tibble [294 × 5] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:294], format: "2020-01-23" "2020-01-24" ...
## $ cumulative_cases : num [1:294] 0 0 0 0 0 0 1 1 1 1 ...
## $ daily_cases : num [1:294] 0 0 0 0 0 0 1 0 0 0 ...
## $ daily_cases_per_10_mil: num [1:294] 0 0 0 0 0 ...
## $ cum_cases_per_100000 : num [1:294] 0 0 0 0 0 ...
max(Mass_data$cumulative_cases) #compare to cases reported by CDC to double check my work - this should check out
## [1] 180189
Great! Make a compelling plot of the timeseries for Massachusetts! Points for style, class, ease of understanding major trends, etc. Note, 10/10 only for the most killer figures. Don’t phone it in! Also, note what the data from JHU is. Do you want the cummulatives, or daily, or what?
Mass_data %>%
ggplot()+
geom_area(mapping=aes(x=date, y=daily_cases, fill=daily_cases), fill= "green", alpha=.5, color="black")+
geom_area(mapping=aes(x=date, y=cum_cases_per_100000, fill=cum_cases_per_100000), show.legend=TRUE, fill= "blue", alpha=.2, color="black")+
labs(title="Massachusetts covid cases",
colour="type",
x= "Date",
y="Cases",
subtitle="Daily cases (green) and cumulative cases per 100,000 residents (blue)")+
theme_bw()
Massachusetts daily covid cases and cumulative cases per 100,000 residents, based on covid data from JHU and TheDataChest’s state population data set.
Cool. Now, write a function that will take what you did above, and create a plot for any state - so, I enter Alaska and I get the plot for Alaska! +2 if it can do daily or cumulative cases - or cases per 100,000 if you did that above. +3 EC if you highlight points of interest - but dynamically using the data. Note, you might need to do some funky stuff to make things fit well in the plot for this one. Or, meh.
#single state plot function
single_state_plot<- function(state){
data <- state_data(state)
plot <- data %>%
ggplot()+
geom_area(data=data, mapping=aes(x=date, y=daily_cases), fill= "green", alpha=.5, color="black")+
geom_area(data=data, mapping=aes(x=date, y=cum_cases_per_100000), fill= "blue", alpha=.2, color="black")+
labs(title=state,
x= "Date",
y="Cases",
subtitle="State daily cases (green) and cumulative cases per 100,000 residents (blue)")+
theme_bw()
return(plot)}
#test single state plot
AK_plot <- single_state_plot("Alaska")
AK_plot
Use what you’ve done - or even new things (data sets, etc) so make compelling informative world-shaking visualizations that compare between states. Feel free to bring in outside information, aggregate things, or do whatever you would like. +5 per awesome viz (and Michael will be grading hard - if you don’t knock his socks off, it might only be +1) and +3 if you accompany it with a function that lets us play around and make new viz.
# Making multi-state plots of daily and cumulative cases/population, arranged by total number of cases for states
# Step 1: Make data frame with to-date cumulative cases by state - for ordering multi-state plots
covid_states <- covid_data %>%
pivot_longer(cols= -Province_State,
names_to= "date",
values_to = "cumulative_cases") %>%
mutate(date=mdy(date)) %>%
group_by(Province_State,date) %>%
summarize(cumulative_cases=sum(cumulative_cases)) %>%
group_by(Province_State) %>%
summarize(cumulative_cases=max(cumulative_cases)) %>%
arrange(desc(cumulative_cases)) %>% #sort in descending order of cumulative cases
#remove states from covid data that are not in pop data- otherwise this causes a problem when they go through the state_data() function within state_plot()
filter(Province_State != "American Samoa", Province_State!= "Diamond Princess", Province_State!= "Northern Mariana Islands", Province_State!= "Grand Princess", Province_State!= "Guam", Province_State!= "Puerto Rico", Province_State!= "Virgin Islands") %>%
ungroup()
str(covid_states, list.len = 5)
## tibble [51 × 2] (S3: tbl_df/tbl/data.frame)
## $ Province_State : chr [1:51] "Texas" "California" "Florida" "New York" ...
## $ cumulative_cases: num [1:51] 1024073 1004116 863619 545762 536542 ...
# Step 2: Merge state data with population data
state_data_for_merge <- state_pop_data %>%
rename(Province_State=State) %>% #rename this column to use in the merge
dplyr::select(!Year)
# new df with population data...
covid_states <- covid_states %>%
merge(state_data_for_merge, by="Province_State", sort=FALSE) %>%
mutate(cum_cases_per_100000= (cumulative_cases*100000)/Population)
str(covid_states, list.len = 5)
## 'data.frame': 51 obs. of 4 variables:
## $ Province_State : chr "Texas" "California" "Florida" "New York" ...
## $ cumulative_cases : num 1024073 1004116 863619 545762 536542 ...
## $ Population : num 27469114 39144818 20271272 19795791 12859995 ...
## $ cum_cases_per_100000: num 3728 2565 4260 2757 4172 ...
# Step 3: Get list of states in descending order of number of cases
covid_list <- c(covid_states$Province_State[1:51])
covid_list
## [1] "Texas" "California" "Florida"
## [4] "New York" "Illinois" "Georgia"
## [7] "Wisconsin" "North Carolina" "Tennessee"
## [10] "Ohio" "New Jersey" "Arizona"
## [13] "Michigan" "Pennsylvania" "Indiana"
## [16] "Missouri" "Alabama" "Minnesota"
## [19] "Louisiana" "Virginia" "South Carolina"
## [22] "Massachusetts" "Iowa" "Maryland"
## [25] "Colorado" "Oklahoma" "Utah"
## [28] "Mississippi" "Kentucky" "Arkansas"
## [31] "Washington" "Nevada" "Kansas"
## [34] "Nebraska" "Connecticut" "Idaho"
## [37] "New Mexico" "South Dakota" "North Dakota"
## [40] "Oregon" "Montana" "Rhode Island"
## [43] "West Virginia" "Delaware" "Alaska"
## [46] "Wyoming" "District of Columbia" "Hawaii"
## [49] "New Hampshire" "Maine" "Vermont"
# Step 4: Multi-state plot function - minimizes plot elements for better presentation
multi_state_plot<- function(state){
data <- state_data(state)
title <- paste(paste(state, ",", sep=""), max(data$cumulative_cases), "total cases", sep = " ")
plot <- data %>%
ggplot()+
geom_area(data=data, mapping=aes(x=date, y=daily_cases_per_10_mil), fill= "green", alpha=.4, color="black")+
geom_area(data=data, mapping=aes(x=date, y=cum_cases_per_100000), fill= "blue", alpha=.3, color="black")+
labs(title=title, x=NULL, y=NULL)+
scale_y_continuous(limits=c(0,max(covid_states$cum_cases_per_100000)))+ #uniform y-axis for states, based on state with most cumulative cases
theme_bw(base_size = 10)
return(plot)}
# Step 5: run state_plot() for each covid_list state value, using lapply()!
covid_plot_list <- lapply(covid_list, multi_state_plot)
# Step 6: all plots! Now in order of most cumulative cases to fewest cumulative cases!
do.call("grid.arrange", c(covid_plot_list,
ncol=5,
nrow=11,
bottom="Daily COVID-19 cases per 1 million residents (green) and cumulative cases per 100,000 residents (blue), by state, in order of most total cases to fewest total cases."))
We have discussed multiple inferential frameworks this semester. Frequentist NHST, Likelihood and model comparison, Baysian probabilistic thinking, Assessment of Predictive Ability (which spans frameworks!), and more. We’ve talked about Popper and Lakatos. Put these pieces of the puzzle together and look deep within yourself.
What do you feel is the inferential framework that you adopt as a scientist? Why? Include in your answer why you prefer the inferential tools (e.g. confidence intervals, test statistics, out-of-sample prediction, posterior probabilities, etc.) of your chosen worldview and why you do not like the ones of the other one. This includes defining just what those different tools mean, as well as relating them to the things you study. extra credit for citing and discussing outside sources - one point per source/point
I prefer inferential frameworks based on likelihood, particularly Bayes. Conceptually, I like how these methods allow us to assess the possibility of a hypothesis in terms of degree of belief. Bayes is particularly appealing because incorporating prior knowledge into an estimate for a value (a hypothesis) feels more intuitive than testing null hypotheses with only the data on hand. In the context of my study system, I appreciate how Bayes allows for the interpretation of an estimate as something that is probabilistically aligned with a range of values, rather than a single value that we attempt to estimate. This means that my variables of interest, such as amphibian Bd infection prevalence, do not have to be treated as a constant single value for the population over time. Instead, they can be a treated as a range in which normal fluctuation around a mean occurs within a given timeframe. For multi-season studies, I like that Bayesian approaches make it possible to consider data across years to identify the most likely patterns, accounting for prior knowledge from past field seasons. I also find the Bayesian credible interval easier to interpret than a frequentist confidence interval because it represents the span of parameter values at which there is a probability of the actual value existing given the prior and data, whereas the frequentist confidence interval describes the uncertainty of the range of parameter values, but cannot be interpreted as probability of the actual values (Skharat on StackExchange). As far as applying data to decision-making, it’s nice that we can use the distribution of predictions from likelihood estimates to visualize posteriors in Bayesian models and compare how different degrees of belief influence interpretation of the fit. This is good for conservation planning, because the likelihood of different values can influence management decisions that have tradeoffs. In a least-squares model with no likelihood dimension, I think it would be more difficult to make a nuanced decision based on a model. Lastly, I like that Bayes propagates uncertainty rather than basing conclusions on a single fit value with uncompounded uncertainty (as in likelihood or least-squares). Given the choice of Lakanto’s research program vs. strict Popperian Falsification, I prefer Lakatos’s view because the end product– a robust theory based on falsifiable hypotheses– seems more useful than null hypothesis testing on its own. I imagine that an inductive theory or core belief, when well-supported, will be more helpful for turning science and logic into meaningful action.
I’ve referenced the following figure a few times. I’d like you to demonstrate your understanding of Bayes Theorem by hand (e.g. calculate it out and show your work - you can do this all in R, I’m not a monster) showing what is the probability of the sun exploding is given that the device said yes. Assume that your prior probability that the sun explodes is p(Sun Explodes) = 0.0001 (I’ll leave it to you to get p(Sun Doesn’t Explode). The rest of the information you need - and some you don’t - is in the cartoon - p(Yes | Explodes), p(Yes | Doesn’t Explode), p(No | Explodes), p(No | Doesn’t Explode).
If S= sun explosion and R= result…
The probability of the sun exploding, given the result, is equal to the chance of a true positive (P(R|S)) multiplied by the chance of the sun exploding (given by the prior = 0.0001), divided by the probability of any positive result– including a false positive:
P(S|R) = (P(R|S)sun_prior) / (P(R|S)sun_prior + P(R|1-S)*(1-sun_prior))
sun_prior <- 0.0001 #P(S) the probability that the sun explodes
false_positive <- 0.027 #P(R|1-S) the probability of the positive result given that the sun did not explode
true_positive <- (1 - false_positive) #P(R|S) the probability of the positive result given that the sun did explode
#Therefore, the probability that the sun did explode, given the result, is:
explode_prob <- (sun_prior*true_positive) / ((sun_prior*true_positive) + ((1-sun_prior)*false_positive))
explode_prob
## [1] 0.003591121
There is a 0.0035911 chance that the sun has exploded, given the result.
Why is this a bad parody of frequentist statistics?
This comic is a bad parody of frequentest statistics because the p-value only represents the probability of the null hypothesis given the data, not the probability that the alternate hypothesis is true. The statistician should not conclude that the sun has exploded, but rather that there is a 0.027 chance of the observed data, given the null hypothesis of no explosion.
I’d like us to walk through the three different ‘engines’ that we have learned about to fit linear models. To motivate this, we’ll look at Burness et al.’s 2012 study "Post-hatch heat warms adult beaks: irreversible physiological plasticity in Japanese quail http://rspb.royalsocietypublishing.org/content/280/1767/20131436.short the data for which they have made available at Data Dryad at http://datadryad.org/resource/doi:10.5061/dryad.gs661. We’ll be looking at the morphology data.
#start by exploring the data and removing NA values for tarsus and culmen
#read in data
quail_data <- read_csv("midterm_quail_data/Morphology data.csv") %>%
janitor::clean_names()
#what's here
str_quail <- str(quail_data)
## tibble [880 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ bird_number : num [1:880] 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : chr [1:880] NA "m" "f" "m" ...
## $ age_days : num [1:880] 5 5 5 5 5 5 5 5 5 5 ...
## $ exp_temp_degree_c: num [1:880] 15 15 15 15 15 15 15 15 15 15 ...
## $ mass_g : num [1:880] 16.1 19.2 17.5 14.4 17.4 ...
## $ tarsus_mm : num [1:880] 19.4 20.4 19 20.1 21.8 ...
## $ culmen_mm : num [1:880] 7.64 7.49 7.31 7.34 8.24 6.82 7.84 7.39 7.4 7.81 ...
## $ depth_mm : num [1:880] 4.23 4.46 3.92 3.85 4.42 3.65 3.94 3.72 4.5 4.09 ...
## $ width_mm : num [1:880] 4.49 4.44 4.01 4.22 4.56 3.73 4.6 4.66 3.83 3.89 ...
## $ notes : chr [1:880] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. `Bird #` = col_double(),
## .. Sex = col_character(),
## .. `Age (days)` = col_double(),
## .. `Exp. Temp. (degree C)` = col_double(),
## .. `Mass (g)` = col_double(),
## .. `Tarsus (mm)` = col_double(),
## .. `Culmen (mm)` = col_double(),
## .. `Depth (mm)` = col_double(),
## .. `Width (mm)` = col_double(),
## .. NOTES = col_character()
## .. )
#check for NAs
visdat::vis_miss(quail_data) #NAs in tarsus and culmen
#remove data points with NA for tarsus or culmen
quail_data <- quail_data %>%
filter(!is.na(tarsus_mm),!is.na(culmen_mm))
#check for NAs again
visdat::vis_miss(quail_data) #no more NAs in columns of interest
#visualize data
#plot culmen ~ tarsus
ggplot(quail_data, mapping=aes(x=tarsus_mm, y=culmen_mm))+
geom_point()+
stat_smooth(method="lm", se=FALSE)+ #just to visualize linearity- looks close
theme_bw()
To begin with, I’d like you to fit the relationship that describes how Tarsus (leg) length predicts upper beak (Culmen) length. Fit this relationship using least squares, likelihood, and Bayesian techniques. For each fit, demonstrate that the necessary assumptions have been met. Note, functions used to fit with likelihood and Bayes may or may not behave well when fed NAs. So look out for those errors.
In addition to their specific requirements, all three of these models must meet the basic assumptions for linear regression analysis:
Validity - The data must reflect the variables of interest. In this case, the variables are the morphometrics themselves, so it should be valid.
Representativeness - The data should represent the population. Based on the background provided by the study, this seems to be a representative sample of the quail in the experimental system. They measured 36/40 birds of the age of interest, and deformed-beak individuals were excluded because they were not part of the population of interest (normal birds, that is).
Model captures features in the data - See simulations of models below to compare with observed data. They should align.
Additivity and Linearity - The above plot of the variables appears to be additive and linear, although there is a chance that the relationship is exponential. Values are over-predicted at the lower end and under-predicted at the upper end.
Independence of Errors - Replicates must be truly independent. Eggs were split into two chambers early in the study, and then treatments were applied to groups of birds. Due to groupings, individual birds might not technically be independent…
Equal Variance of Errors - See plots of residuals vs fitted values for each fit, below. Ideally, there would be no relationship.
Normality of Errors - See qqplots for each fit, below. They should follow the line.
Minimal Outlier Influence - See cook’s distance plots for each fit, below. Ideally, there will be no values >1
additional assumptions are tested for likelihood and Bayes, below
# LEAST SQUARES - fit lm ####
quail_lm <- lm(culmen_mm~tarsus_mm, data=quail_data)
# Check assumptions ####
#Equal variance and normality of errors:
plot(quail_lm, which=1) # # maybe a bit heteroscedastic
plot(quail_lm, which=2) # strays a bit below the line at the bottom and above at the top-- data could have long-ish tails, but it is not too far from the line
hist(residuals(quail_lm)) # residuals look normal
shapiro.test(residuals(quail_lm)) # residuals technically fail normality test
##
## Shapiro-Wilk normality test
##
## data: residuals(quail_lm)
## W = 0.99384, p-value = 0.003158
#Minimal outlier influence:
plot(quail_lm, which=4) # no values >1
#Model captures features in the data:
# simulate data and plot against real data - looks good!
quail_sims <- simulate(quail_lm, nsim= 30) %>%
pivot_longer(
cols=everything(),
names_to= "sim",
values_to = "length.cm"
)
ggplot()+
geom_density(data=quail_sims,
mapping=aes(x=length.cm, group=sim),
size= 0.2)+
geom_density(data= quail_sims,
mapping=aes(x=length.cm),
size=2, color="blue")
# Evaluate model! ####
confint(quail_lm) #tarsus slope does not cross 0
## 2.5 % 97.5 %
## (Intercept) -0.5216505 0.3242363
## tarsus_mm 0.3598809 0.3859727
summary(quail_lm) #tarsus slope is significant
##
## Call:
## lm(formula = culmen_mm ~ tarsus_mm, data = quail_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4081 -0.7029 -0.0328 0.7263 3.5970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.098707 0.215450 -0.458 0.647
## tarsus_mm 0.372927 0.006646 56.116 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.238 on 764 degrees of freedom
## Multiple R-squared: 0.8048, Adjusted R-squared: 0.8045
## F-statistic: 3149 on 1 and 764 DF, p-value: < 2.2e-16
# LIKELIHOOD - fit glm ####
quail_glm <- glm(culmen_mm~tarsus_mm,
data=quail_data,
family = gaussian(link="identity"))
# Check assumptions ####
#Equal variance and normality of errors:
plot(quail_lm, which=1) # # maybe a bit heteroscedastic
plot(quail_lm, which=2) # strays a bit below the line at the bottom and above at the top-- data could have long-ish tails, but it is not too far from the line
hist(residuals(quail_lm)) # residuals look normal
shapiro.test(residuals(quail_lm)) # residuals technically fail normality test
##
## Shapiro-Wilk normality test
##
## data: residuals(quail_lm)
## W = 0.99384, p-value = 0.003158
#Minimal outlier influence:
plot(quail_glm, which=4) # no values >1
#Model captures features in the data:
# simulate data and plot against real data - looks good!
quail_sims <- simulate(quail_glm, nsim= 30) %>%
pivot_longer(
cols=everything(),
names_to= "sim",
values_to = "length.cm"
)
ggplot()+
geom_density(data=quail_sims,
mapping=aes(x=length.cm, group=sim),
size= 0.2)+
geom_density(data= quail_sims,
mapping=aes(x=length.cm),
size=2, color="blue")
#check profiles for intercept and tarsus_mm - they are nice parabolas centered around
#the values, so this looks good.
prof <- profileModel(quail_glm,
objective= "ordinaryDeviance")
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter tarsus_mm ... Done
plot(prof, print.grid.points = TRUE)
# Evaluate model! ####
confint(quail_glm) #tarsus slope does not cross 0
## 2.5 % 97.5 %
## (Intercept) -0.5209805 0.3235663
## tarsus_mm 0.3599015 0.3859520
summary(quail_glm) #tarsus slope is significant
##
## Call:
## glm(formula = culmen_mm ~ tarsus_mm, family = gaussian(link = "identity"),
## data = quail_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4081 -0.7029 -0.0328 0.7263 3.5970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.098707 0.215450 -0.458 0.647
## tarsus_mm 0.372927 0.006646 56.116 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.533593)
##
## Null deviance: 6000.9 on 765 degrees of freedom
## Residual deviance: 1171.7 on 764 degrees of freedom
## AIC: 2505.4
##
## Number of Fisher Scoring iterations: 2
color_scheme_set("viridis") #make things look nice
# BAYES - fit brm ####
quail_brm <- brm(culmen_mm~tarsus_mm,
data=quail_data,
family = gaussian(link="identity"),
chains=3,
seed=802)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
##
## SAMPLING FOR MODEL 'e754aa52dd0feb2be6d67820a7a0fe08' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 2.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.024321 seconds (Warm-up)
## Chain 1: 0.02242 seconds (Sampling)
## Chain 1: 0.046741 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'e754aa52dd0feb2be6d67820a7a0fe08' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.2e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.024533 seconds (Warm-up)
## Chain 2: 0.025338 seconds (Sampling)
## Chain 2: 0.049871 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'e754aa52dd0feb2be6d67820a7a0fe08' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 3e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.3 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.024463 seconds (Warm-up)
## Chain 3: 0.022903 seconds (Sampling)
## Chain 3: 0.047366 seconds (Total)
## Chain 3:
# Check assumptions ####
#are chains aligned with one another for each estimate?
plot(quail_brm)
mcmc_trace(quail_brm) #yes, chains are aligned
#look at diagnostic of convergence
rhat(quail_brm) #Gelman_Rubin statistic (Rhat) is close to 1
## b_Intercept b_tarsus_mm sigma lp__
## 0.9996609 0.9996993 0.9999259 0.9999015
rhat(quail_brm) %>% mcmc_rhat() #in plot, we can see that all Rhats are left of the dashed line (good)
#assess autocorrelation
mcmc_acf(quail_brm) #It starts at 1 (totally correlated), and then drops to 0-- so, no autocorrelation!
#check the match between out data and our chains for dist of y
pp_check(quail_brm, "dens_overlay") #does not quite fit...
#is our error normal?
pp_check(quail_brm, "error_hist", bins=10) #looks normal
#equal variance of errors - see fitted vs residuals:
quail_res <- residuals(quail_brm) %>% #first get residuals
as_tibble #make df
quail_fit <- fitted(quail_brm) %>% #then get fitted values
as_tibble
plot(y=quail_res$Estimate, x=quail_fit$Estimate) # maybe a bit heteroscedastic
# Evaluate model! ####
summary(quail_brm) #tarsus slope is significant, CI does not cross 0
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: culmen_mm ~ tarsus_mm
## Data: quail_data (Number of observations: 766)
## Samples: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 3000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -0.10 0.21 -0.52 0.32 1.00 3451 2080
## tarsus_mm 0.37 0.01 0.36 0.39 1.00 3537 2364
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 1.24 0.03 1.18 1.30 1.00 2465 2171
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Do the coefficients and their associated measures of error in their estimation match? How would we interpret the results from these different analyses differently? Or would we? Note, confint works on lm objects as well.
The coefficient estimates and standard error for least squares, likelihood, and bayesian models were all the same. These methods arrived at the same place, but the interpretation of the parameter values is not the same: In least squares, the coefficient represents the parameter values in the line formula that minimize the sum of the squares of the residuals; in likelihood, it is the most likely value (hypothesis) given our data; in bayes, they represent the most likely parameter values (hypothesis), given both our prior knowledge and our current data.
Because these models describe association between two variables (rather than prediction or counterfactuals), the results from all models suggest that a 1 mm increase in culmen length is associated with a 0.37 mm increase in tarsus length.
For your likelihood fit, are your profiles well behaved? For just the slope, use grid sampling to create a profile. You’ll need to write functions for this, sampling the whole grid of slope and intercept, and then take out the relevant slices as we have done before. Use the results from the fit above to provide the reasonable bounds of what you should be profiling over (3SE should do). Is it well behaved? Plot the profile and give the 80% and 95% CI (remember how we use the chisq here!). Verify your results with profileModel.
Yes, profiles for the likelihood fit are well behaved (see below). The likelihood surface from grid sampling is well behaved- the only caveat is that the profile of the standard deviation of the slope estimate has a peak to the left of the center. The 80% CI for the slope estimate is 0.3511706 - 0.3812709, and the 95% CI is 0.3678930 - 0.4247492. These results are similar to those from profileModel().
#(prof was calculated in 5a)
plot(prof, print.grid.points = TRUE) #even parabola = well behaved profiles
#tau
prof_quail <- profile(quail_glm)
plot(prof_quail) #straight line = well behaved profiles
#confidence interval - slope for tarsus does not cross 0!
confint(prof_quail)
## 2.5 % 97.5 %
## (Intercept) -0.5209805 0.3235663
## tarsus_mm 0.3599015 0.3859520
#use values from glm summary to inform likelihood surface:
#tarsus slope estimate from glm= 0.37293
#tarsus slope standard error from glm = 0.006646
#calculate SD of slope estimate to use in likelihood surface:
quail_n <- length(quail_data) #sample size
quail_se <- 0.006646 #standard error of estimate
quail_sd <- quail_n*quail_se #standard deviation= 0.06646
#get random sample with mean and SD of slope estimate for likelihood surface
samp <- rnorm(20,mean=0.37293, sd=0.06646)
#function for likelihood
norm_lik <- function(m, s){
dnorm(samp, mean = m, sd = s, log = FALSE) %>% prod()
}
#function for log-likelihood
norm_loglik <- function(m, s){
dnorm(samp, mean = m, sd = s, log = TRUE) %>% sum()
}
#grid sample for slope
lik_df_norm <- crossing(m=seq(0,1, length.out=300),
s=seq(0,.35, length.out=300)) %>%
group_by(m,s) %>%
mutate(lik=norm_lik(m,s),
loglik= norm_loglik(m,s),
deviance=-2*loglik) %>%
ungroup()
#visualize
ggplot(lik_df_norm %>% filter(loglik>max(loglik)-5),
aes(x=m, y=s, z=loglik))+
geom_contour_filled(bins=20)
#MLE of parameters
lik_df_norm %>% filter(deviance == min(deviance))
## # A tibble: 1 x 5
## m s lik loglik deviance
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.358 0.0737 23106731036. 23.9 -47.7
#get slice across values of m (profile for slope estimate mean)
like_prof_m <- lik_df_norm %>%
group_by(m) %>%
filter(deviance == min(deviance)) %>%
ungroup()
#get slice across values of s (profile for slope estimate sd)
like_prof_s <- lik_df_norm %>%
group_by(s) %>%
filter(deviance == min(deviance)) %>%
ungroup()
#plot profiles:
#slope estimate mean
ggplot(like_prof_m %>% filter(loglik>max(loglik)-5),
aes(x=m, y=loglik))+
geom_point()
#slope estimate sd
ggplot(like_prof_s %>% filter(loglik>max(loglik)-5),
aes(x=s, y=loglik))+
geom_point()
#95% CI for slope
CI_data_frame_95 <- lik_df_norm %>%
filter(loglik>=max(loglik) - qchisq(0.95, df=1)/2) %>%
as.data.frame()
#View(CI_data_frame_95) #95% CI= 0.3678930 - 0.4247492
#80% CI for slope
CI_data_frame_80 <- lik_df_norm %>%
filter(loglik>=max(loglik) - qchisq(0.8, df=1)/2) %>%
as.data.frame()
#View(CI_data_frame_80) #80% CI= 0.3511706 - 0.3812709
#check using profileModel()
prof_ci <- profileModel(quail_glm,
objective= "ordinaryDeviance",
quantile= qchisq(0.95, 1))
## Preliminary iteration .. Done
##
## Profiling for parameter (Intercept) ... Done
## Profiling for parameter tarsus_mm ... Done
plot(prof_ci)
This data set is pretty big. After excluding NAs in the variables we’re interested in, it’s over 766 lines of data! Now, a lot of data can overwhelm a strong prior. But only to a point. Show first that there is enough data here that a prior for the slope with an estimate of 0.7 and a sd of 0.01 is overwhelmed and produces similar results to the default prior. How different are the results from the original?
The prior is overwhelmed by the data (it falls outside of 95% CI, so it is very unlikely). The posterior estimate of tarsus for the model with the 0.7 prior is 0.5040553, as opposed to the estimate of 0.3730030 from the model with a default prior. So, with the prior of 0.7, the estimate is 35% greater than the default.
Second, randomly sample 10, 100, 300, and 500 data points. At which level is our prior overwhelmed (e.g., the prior slope becomes highly unlikely)? Communicate that visually in the best way you feel gets the point across, and explain your reasoning. +4 for a function that means you don’t have to copy and paste the model over and over. + 4 more if you use map() in combination with a tibble to make this as code-efficient as possible. This will also make visualization easier.
Our prior becomes highly unlikely by the time we add about 100 data points from this data set to the model. I visualized this with half-eye plots of the posterior probability distribution for models with the 10, 100, 300, and 500 data points, using a red line to indicate the prior value of 0.7. This makes it easy to see where the prior value falls along the CI for the posterior estimate.
#confirm that: yes, I have 766 lines of data
length(quail_data$tarsus_mm)
## [1] 766
#1. show that prior with m=0.7, sd= 0.01 is overwhelmed
#fit model with the prior
quail_brm_prior <- brm(culmen_mm~tarsus_mm,
data=quail_data,
family = gaussian(link="identity"),
chains=3,
prior=c(prior(coef="tarsus_mm",
prior=normal(0.7, 0.01))))
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
##
## SAMPLING FOR MODEL '5d23065a97a2635e9e674d05e729ca41' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 2.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.031559 seconds (Warm-up)
## Chain 1: 0.028214 seconds (Sampling)
## Chain 1: 0.059773 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '5d23065a97a2635e9e674d05e729ca41' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.033218 seconds (Warm-up)
## Chain 2: 0.026921 seconds (Sampling)
## Chain 2: 0.060139 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '5d23065a97a2635e9e674d05e729ca41' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 1.1e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.029213 seconds (Warm-up)
## Chain 3: 0.027318 seconds (Sampling)
## Chain 3: 0.056531 seconds (Total)
## Chain 3:
fixef(quail_brm) # model with default prior, from 5a. Posterior = 0.3730030
## Estimate Est.Error Q2.5 Q97.5
## Intercept -0.1019235 0.214475743 -0.5238951 0.3223449
## tarsus_mm 0.3730030 0.006671086 0.3599324 0.3860432
fixef(quail_brm_prior) # the prior is overwhelmed by the data (it falls outside of 95% CI - very unlikely). The posterior estimate for the model with the 0.7 prior is 0.5040553, as opposed to the estimate of 0.3730030 from the model with a default prior
## Estimate Est.Error Q2.5 Q97.5
## Intercept -4.2503104 0.264627489 -4.7703772 -3.7536407
## tarsus_mm 0.5038107 0.008096469 0.4885888 0.5196971
#2. randomly sample 10, 100, 300, and 500 points from the data set- at what point does the prior become overwhelmed (prior slope becomes highly unlikely)?
#make function to sample x values from quail_data and then fit a bayes model
quail_subsample_bayes <- function(number){
quail_data_n <- slice_sample(.data=quail_data, n=number, replace=TRUE)
quail_brm_subsample <- brm(culmen_mm~tarsus_mm,
data=quail_data_n,
family = gaussian(link="identity"),
chains=3,
prior=c(prior(coef="tarsus_mm",
prior=normal(0.7, 0.01))))
return(quail_brm_subsample)}
#for 10 random values...
quail_subsample_10 <- quail_subsample_bayes(10)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
##
## SAMPLING FOR MODEL 'a395f65fdde3aac405fb42a9efd035ae' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 1.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.036686 seconds (Warm-up)
## Chain 1: 0.016236 seconds (Sampling)
## Chain 1: 0.052922 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'a395f65fdde3aac405fb42a9efd035ae' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 3.6e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.36 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.025292 seconds (Warm-up)
## Chain 2: 0.015097 seconds (Sampling)
## Chain 2: 0.040389 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'a395f65fdde3aac405fb42a9efd035ae' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 8e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.029988 seconds (Warm-up)
## Chain 3: 0.014729 seconds (Sampling)
## Chain 3: 0.044717 seconds (Total)
## Chain 3:
plot(quail_subsample_10)
fixef(quail_subsample_10) #tarsus estimate = 0.6981741, 95% CI includes the prior (not overwhelmed)
## Estimate Est.Error Q2.5 Q97.5
## Intercept -10.1954173 1.0244294 -12.2198924 -8.1411956
## tarsus_mm 0.6979718 0.0105003 0.6785333 0.7181052
#for 100 random values...
quail_subsample_100<- quail_subsample_bayes(100)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
##
## SAMPLING FOR MODEL '080dc755d7d094ba738721fbb49721d4' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 3.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.34 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.020153 seconds (Warm-up)
## Chain 1: 0.015453 seconds (Sampling)
## Chain 1: 0.035606 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '080dc755d7d094ba738721fbb49721d4' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.019395 seconds (Warm-up)
## Chain 2: 0.014247 seconds (Sampling)
## Chain 2: 0.033642 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '080dc755d7d094ba738721fbb49721d4' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 7e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.07 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.020278 seconds (Warm-up)
## Chain 3: 0.015072 seconds (Sampling)
## Chain 3: 0.03535 seconds (Total)
## Chain 3:
plot(quail_subsample_100)
fixef(quail_subsample_100) #tarsus estimate = 0.6757016, prior value of 0.7 is at very edge of 95% CI (getting overwhelmed)
## Estimate Est.Error Q2.5 Q97.5
## Intercept -9.8115123 0.38500347 -10.5843929 -9.0687639
## tarsus_mm 0.6756143 0.01011407 0.6557079 0.6958958
#for 300 random values...
quail_subsample_300<- quail_subsample_bayes(300)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
##
## SAMPLING FOR MODEL '07791d6bc5ace2a815ccf5516613fd56' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 4.4e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.44 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.022773 seconds (Warm-up)
## Chain 1: 0.016783 seconds (Sampling)
## Chain 1: 0.039556 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '07791d6bc5ace2a815ccf5516613fd56' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.023336 seconds (Warm-up)
## Chain 2: 0.018633 seconds (Sampling)
## Chain 2: 0.041969 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '07791d6bc5ace2a815ccf5516613fd56' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 9e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.020828 seconds (Warm-up)
## Chain 3: 0.021237 seconds (Sampling)
## Chain 3: 0.042065 seconds (Total)
## Chain 3:
plot(quail_subsample_300)
fixef(quail_subsample_300) #tarsus estimate = 0.6286125, prior value of 0.7 is outside of 95% CI (now overwhelmed)
## Estimate Est.Error Q2.5 Q97.5
## Intercept -8.2802642 0.34554813 -8.9473379 -7.5923555
## tarsus_mm 0.6267617 0.01008848 0.6071412 0.6466348
#for 500 random values...
quail_subsample_500<- quail_subsample_bayes(500)
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DBOOST_NO_AUTO_PTR -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fPIC -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
##
## SAMPLING FOR MODEL '9446246a6b22035993032c50ed143e7b' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 1.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.028056 seconds (Warm-up)
## Chain 1: 0.022936 seconds (Sampling)
## Chain 1: 0.050992 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '9446246a6b22035993032c50ed143e7b' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 9e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.028933 seconds (Warm-up)
## Chain 2: 0.021986 seconds (Sampling)
## Chain 2: 0.050919 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '9446246a6b22035993032c50ed143e7b' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 8e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.025769 seconds (Warm-up)
## Chain 3: 0.021476 seconds (Sampling)
## Chain 3: 0.047245 seconds (Total)
## Chain 3:
plot(quail_subsample_500)
fixef(quail_subsample_500) #tarsus estimate = 0.5668878, prior value of 0.7 is well outside of 95% CI (overwhelmed)
## Estimate Est.Error Q2.5 Q97.5
## Intercept -6.4333237 0.34870676 -7.1079978 -5.7557123
## tarsus_mm 0.5717941 0.01074628 0.5509582 0.5927897
#3. communicate visually
#use gather_draws to get probability of tarsus values from each model
quail_subsample_10_draws <- quail_subsample_10 %>%
gather_draws(b_tarsus_mm)
quail_subsample_100_draws <- quail_subsample_100 %>%
gather_draws(b_tarsus_mm)
quail_subsample_300_draws <- quail_subsample_300 %>%
gather_draws(b_tarsus_mm)
quail_subsample_500_draws <- quail_subsample_500 %>%
gather_draws(b_tarsus_mm)
# plot posteriors and prior to see where the prior falls on the CI
plot_10 <- ggplot(quail_subsample_10_draws,
aes(x=.value))+
stat_halfeye(.width=c(0.9, 0.95), fill="sky blue", alpha=0.5)+ #posterior
labs(title="10 data points", x="tarsus_mm estimate", y= "probability")+
geom_vline(xintercept=0.7, color="red", size=1, linetype= "dotted")+ #prior value
theme_bw()
plot_100 <- ggplot(quail_subsample_100_draws,
aes(x=.value))+
stat_halfeye(.width=c(0.9, 0.95), fill="sky blue", alpha=0.5)+
labs(title="100 data points", x="tarsus_mm estimate", y= "probability")+
geom_vline(xintercept=0.7, color="red", size=1, linetype= "dotted")+
theme_bw()
plot_300 <- ggplot(quail_subsample_300_draws,
aes(x=.value))+
stat_halfeye(.width=c(0.9, 0.95), fill="sky blue", alpha=0.5)+
labs(title="300 data points", x="tarsus_mm estimate", y= "probability")+
geom_vline(xintercept=0.7, color="red", size=1, linetype= "dotted")+
theme_bw()
plot_500 <- ggplot(quail_subsample_500_draws,
aes(x=.value))+
stat_halfeye(.width=c(0.9, 0.95), fill="sky blue", alpha=0.5)+
labs(title="500 data points", x="tarsus_mm estimate", y= "probability")+
geom_vline(xintercept=0.7, color="red", size=1, linetype= "dotted")+
theme_bw()
#arrange all plots together for comparison - we can see the prior becomes very unlikely with 100 data points from the data set incoporated into the model.
grid.arrange(plot_10,plot_100, plot_300, plot_500)
There is some interesting curvature in the culmen-tarsus relationship. Is the relationship really linear? Squared? Cubic? Exponential? Use one of the cross-validation techniques we explored to show which model is more predictive. Justify your choice of technique. Do you get a clear answer? What does it say?
It looks like the relationship is exponential. I used AIC to compare the 4 models because it uses log-likelihood and the number of parameters to evaluate predictive ability as an estimate of out-of-sample deviance, penalizing over-fit models. The exponential model ranks much better than the other fits, with a lower AIC value than the second-best model by 30 units. Since this difference is greater than 10, we can tell that the cubed, linear, and squared models are very unlikely and the exponential model is favored.
#fit linear, squared, cubic, and exponential models of culmen~tarsus
linear_quail <- glm(culmen_mm~tarsus_mm,
data= quail_data,
family = gaussian(link="identity"))
squared_quail <- glm(culmen_mm~poly(tarsus_mm,2),
data= quail_data,
family = gaussian(link="identity"))
cubed_quail <- glm(culmen_mm~poly(tarsus_mm,3),
data= quail_data,
family = gaussian(link="identity"))
exponential_quail <- glm(culmen_mm~tarsus_mm,
data= quail_data,
family = Gamma(link="identity"))
#compare predictive ability of models with AIC
mod_list <- list(linear_quail,squared_quail,cubed_quail, exponential_quail)
name_vec <- c("linear", "squared", "cubed", "exponential")
aic_table <- aictab(cand.set = mod_list, modnames = name_vec) #clear winner is the exponential model
aic_table